Skip to content

[FLINK-21309][metrics] Add UUID as grouping key for prom pushgateway#27744

Open
qinghui-xu wants to merge 2 commits intoapache:masterfrom
qinghui-xu:FLINK-21309
Open

[FLINK-21309][metrics] Add UUID as grouping key for prom pushgateway#27744
qinghui-xu wants to merge 2 commits intoapache:masterfrom
qinghui-xu:FLINK-21309

Conversation

@qinghui-xu
Copy link
Contributor

What is the purpose of the change

Previously we are relying on random uuid suffix of job label to avoid metric collisions among taskmangers. This makes it difficult to aggregate over job label.
Now we use the uuid as grouping key (flink_prometheus_push_gateway_reporter_id) instead, while respecting prometheus guide line on job label usage.

Brief change log

  • Remove the randomJobNameSuffix flag from the config options
  • job label will not be suffixed
  • Systematically add flink_prometheus_push_gateway_reporter_id = UUID as last grouping key in all case.

Verifying this change

Please make sure both new and modified tests in this PR follow the conventions for tests defined in our code quality guide.

This change added tests and can be verified as follows:

  • Added unit test to make sure that flink_prometheus_push_gateway_reporter_id = UUID is appended to user provided grouping keys.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (no)
  • The serializers: (no)
  • The runtime per-record code paths (performance sensitive): (no)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (no)
  • The S3 file system connector: (no)

Documentation

  • Does this pull request introduce a new feature? (no)
  • If yes, how is the feature documented? (not applicable)

@flinkbot
Copy link
Collaborator

flinkbot commented Mar 6, 2026

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

Previously we are relying on random uuid suffix of `job` label to avoid metric collisions among taskmangers.
This makes it difficult to aggregate over `job` label.
Now we use the uuid as grouping key instead, while respecting prometheus guide line on `job` label usage.
@qinghui-xu
Copy link
Contributor Author

@flinkbot run azure

*/
@PublicEvolving
public class PrometheusPushGatewayReporter extends AbstractPrometheusReporter implements Scheduled {
public static final String REPORTER_ID_GROUPPING_KEY =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor spelling nit: I think we want this to be called `REPORTER_ID_GROUPING_KEY.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

.defaultValue("")
.withDescription("The job name under which metrics will be pushed");

public static final ConfigOption<Boolean> RANDOM_JOB_NAME_SUFFIX =
Copy link
Contributor

@rionmonster rionmonster Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure we want to remove this configuration entirely? I’d imagine there are existing jobs and/or users already relying on it. This would also require updating the associated documentation and, at a minimum, feels like something that should be deprecated before being removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Let me deprecate it in the PR. I'm wondering should I also change the default value to false while keeping it. I'll use this flag as a switch: if job is not suffixed, a "flink_prometheus_push_gateway_reporter_id" grouping key will be used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sent a followup commit in this sense.


GroupingKeyMap(Map<String, String> customGroupingKeys) {
if (customGroupingKeys.containsKey(REPORTER_ID_GROUPPING_KEY)) {
throw new IllegalArgumentException(
Copy link
Contributor

@rionmonster rionmonster Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll want to add a test here to cover the assertion against the user supplying "flink_prometheus_push_gateway_reporter_id" as an explicit grouping (asserting the expected exception).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@github-actions github-actions bot added the community-reviewed PR has been reviewed by the community. label Mar 9, 2026
- Keep `randomJobNameSuffix` config as deprecated
- When `randomJobNameSuffix` is disabled (by default), group metrics using random reporter id
* grouping keys, so that each taskmanger instance and jobmanager will not collide their metrics
* in the PushGateway.
*/
static class GroupingKeyMap extends AbstractMap<String, String> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The custom GroupingKeyMap feels heavier than necessary for what is effectively appending one synthetic entry.

Since the goal is just to preserve user-defined grouping keys and add REPORTER_ID_GROUPING_KEY as the last entry, would a LinkedHashMap copy be simpler and easier to reason about?

For example:

Map<String, String> grouping = new LinkedHashMap<>(groupingKey);
grouping.put(REPORTER_ID_GROUPING_KEY, reporterId);

That avoids reimplementing AbstractMap, iterator behavior, and custom entrySet() semantics.

"Grouping keys must not contain the reserved key: "
+ REPORTER_ID_GROUPPING_KEY);
}
this.customGroupingKeys = customGroupingKeys;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to defensively copy customGroupingKeys here?

Using new LinkedHashMap<>(customGroupingKeys) would avoid accidental external mutation after construction and also guarantees stable iteration order.

ConfigOptions.key("randomJobNameSuffix")
.booleanType()
.defaultValue(true)
.defaultValue(false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing the default from true to false introduces a behavioral change for existing deployments.

Even though the option is deprecated, preserving the current default for one deprecation cycle may be safer, then switching defaults in a major compatibility window.

.defaultValue(false)
.withDescription(
"Specifies whether a random suffix should be appended to the job name.");
"Specifies whether random suffixing `job` label (the old way) to avoid metric collision among reporters from taskamangers."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The option description could use a wording pass:

  • taskamangers -> taskmanagers
  • random suffixing job label reads a bit awkwardly

Current wording is understandable, but polishing it would improve config readability.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-reviewed PR has been reviewed by the community.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants